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Abstract: The current research endeavors to evaluate the efficacy of regression-based 
machine learning algorithms through an assessment of their performance using diverse 
metrics. The focus of our study involves the implementation of the breast cancer 
Wisconsin (Diagnostic) dataset, employing both the random forest and gradient- 
boosting regression algorithms. In our comprehensive performance analysis, we utilized 
key metrics such as Mean Squared Error (MSE), R-squared, Mean Absolute Error 
(MAB), and Coefficient of Determination (COD), supplemented by additional metrics. 
The evaluation aimed to gauge the algorithms’ accuracy and predictive capabilities. 
Notably, for continuous target variables, the gradient-boosting regression model 
emerged as particularly noteworthy in terms of performance when compared to other 
models. The gradient-boosting regression model exhibited remarkable results, 
highlighting its superiority in handling the breast cancer dataset. The model achieved an 
impressively low MSE value of 0.05, indicating minimal prediction errors. Furthermore, 
the R-squared value of 0.89 highlighted the model's ability to explain the variance in the 
data, affirming its robust predictive power. The Mean Absolute Error (MAE) of 0.14 
reinforced the model's accuracy in predicting continuous outcomes. Beyond these core 
metrics, the study incorporated additional measures to provide a comprehensive 
understanding of the algorithms’ performance. The findings underscore the potential of 
gradient-boosting regression in enhancing predictive accuracy for datasets with 
continuous target variables, particularly evident in the context of breast cancer 
diagnosis. This research contributes valuable insights to the ongoing exploration of 
machine learning algorithms, providing a basis for informed decision-making in 
medical and predictive analytics domains. 


Introduction 

Artificial intelligence is a computer science branch 
that covers machine learning and deep learning concepts. 
With several innovations in numerous domains, machine 
learning has recently grown in importance as a topic of 
study. The discipline is not without its difficulties and 
restrictions, though, including the requirement for a lot of 
data, the possibility of biased algorithms, and the 
difficulty of deciphering and understanding the behavior 


of complicated models. Addressing these issues and 
improving the state of the art in machine learning are the 
main goals of ongoing research. Machine learning 
consists of supervised, semi-supervised and unsupervised 
learning (Mao et al., 2019). The supervised learning 
machine-learning paradigm uses a collection of paired 
input-output training samples to discover the connection 
between a system's input and output. Given that the 
output is viewed as the input's label or oversight, an 
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input-output training sample is also referred to as labelled 
training data (Verbraeken et al., 2020). Supervised 


learning consists of two things: regression and 
classification. 
A supervised machine learning technique for 


predicting continuous values is regression. The final goal 
of the regression process is to draw the line or curve that 
best fits the data. Regression models map the input 
domain into a real-value domain. Classification is another 
technique of supervised learning used to map the input 
with predefined classes (Choi and Lim, 2020; Mishra et 
al., 2004). 

Regression is of different types, which are discussed 
as follows: 
Simple Linear Regression 

Linear regression (Sudhaman et al., 2022) aims to 
reveal the relationship between two variables. One 
variable is supposed to be independent, while the other is 
supposed to be dependent. Simple regression separates 
the influence of independent factors from the interface of 
dependent variables. This linear regression shown in eq.1 
is also known as the population regression function. 

S=Pot Pit He ....... cece cece esc eeeeeeeeeensennenees (1) 


Where Bo and £; are estimates and ¢ is the error term. 
Multivariate Linear Regression 

Multivariate linear regression (Maulud and 
Abdulazeez, 2020) is a supervised learning algorithm that 
involves multiple input independent variables and one 


dependent variable. 


S=Bo + Bi.tit Bo.tot.... 


.+Bntnt+¢.........08 (2) 


It is a technique for simulating the interaction between 
several independent variables i.e. ti to tr and a dependent 
variable s, while assuming a linear relationship. It can be 
applied to both models and anticipate how different 
variables would affect a dependent variable. 

When the relationship between the variables is 
expected to be approximately linear, multivariate 
regression is often used, whereas polynomial regression 
is used when the relationship is expected to be nonlinear. 
However, the method to use is ultimately determined by 
the specific problem and the nature of the data. 
Polynomial Linear Regression 

In this regression, the relationship between the 
independent variable t and the dependent variable s is 
handled (Tabelini et al, 2021) as an  nth-degree 
polynomial in t. Polynomial regression (Jie and Zheng, 
2019) can fit a nonlinear relationship between the value 
of t and the associated conditional mean of s. 

S= Bot Bit + Bot 2+ +--+ Bat te cece eee (3) 
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Where h is the polynomial degree 
The analyst can determine the degree of the 
polynomial function based on the complexity of the 
relationship between the dependent and independent 
variables. A degree 2 polynomial, for example, would fit 
a quadratic relationship between the variables, whereas a 
degree 3 polynomial would fit a cubic relationship. 

During training, the polynomial regression model 
employs an optimization algorithm to determine the 
coefficient values that best fit the training data. Ordinary 
least square is the most commonly used algorithm, which 
minimizes the sum of the squared differences between 
the dependent variable's actual value and its anticipated 
value. 

It should, however, be used with caution because 
higher-degree polynomials can overfit the training data, 
resulting in poor simplification of new data. It is used to 
model complex nonlinear relationships between variables 
in many fields, including finance, engineering, and social 
sciences. After training, the polynomial regression model 
will be used to predict new data by inputting the values of 
the independent variable(s) and using the model to 
compute the corresponding predicted value of the 
dependent variable. 

Logistic Regression 

Logistic regression models are commonly used to 
investigate how various predictors impact categorical 
outcomes. For binary outcomes, such as the existence or 
lack of a disease, a binary logistic model is appropriate. If 
the model includes just one predictor variable, it is 
known as a logistic regression model. On the other hand, 
if the model involves multiple interpreters, such as 
categorical and continuous variables, it is mentioned as a 
multivariable logistic regression (Khadhouri et al., 2022). 

A logistic function (also known as the sigmoid 
function) is used in the logistic regression model to 
convert a linear combination of predictor variables into a 
probability value between 0 and 1. The logistic regression 
equation is as follows: 

DSA NCA) ncrcevnenesnsarsosescaumaaani (4) 


Where: The predicted probability of the dependent 
variable having the value 1 is given by T. 

The direct combination of the predictor variables and 
their coefficients is denoted by n, which can be written 
as: 

n= f0+ BlItl + B2t2+...4 


Where: B0 is the intercept or bias term; 61, £2, ..., 
Bn are the coefficients or weights of the predictor 
variables tl, t2, ..., tn. 
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Formerly trained, the logistic regression model can be 
used to predict new data by inputting the values of the 
predictor variables and using the model to compute the 
corresponding probability of the dependent variable with 
the value 1. To convert the probability value into a binary 
classification decision, a threshold value can be set. The 
threshold value is typically set to 0.5, but it can be 
adjusted to achieve the desired balance of precision and 
recall. 

Ridge Regression 

Ridge regression is a type of regularized linear 
regression that is commonly used in machine learning 
and statistical modelling. It is employed when there are 
many predictor variables (sometimes referred to as 
features) in comparison to the number of observations or 
when the predictor variables have a high degree of 
correlation. 

Ridge regression is a system that comprises adding a 
penalty term to the cost function of ordinary least squares 
(OLS) regression. This cost function minimizes the 
squared difference between the actual and predicted 
values. The added penalty term is based on the L2-norm 
of the regression coefficients, which encourages the 
coefficients to be smaller and helps prevent overfitting of 
the model. 

The ridge regression model is formulated as: 


Where S is the dependent variable, t represents the 
predictor variable matrix, 6 represents the vector of 
regression coefficients, and € stands for the error term. 
The OLS cost function is augmented with a penalty term 
AIB|I*2, where ~ is a hyperparameter that controls the 
strength of the penalty and ||B||‘2 is the L2-norm of the 
coefficient vector. Ridge regression was first proposed by 
Arthur Hoerl and Robert Kennard (Hoerl and Kennard, 
1970) in 1970. Since then, it has become a popular tool 
for dealing with high-dimensional data in a variety of 
fields, including economics, finance, engineering, and 
bioinformatics. 

Lasso regression 

Lasso regression is another type of regularized linear 
regression that is used to address overfitting and feature 
selection. It stands for "Least Absolute Shrinkage and 
Selection Operator" and was coined by (Tibshirani, 
1996). In this, a penalty term is added to the OLS cost 
function, like Ridge regression. However, instead of 
using the L2-norm of the coefficient vector, lasso uses 


the L1-norm. This leads to a sparse solution where some 
of the 


performing feature selection. 
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coefficients are exactly zero, effectively 


The lasso regression model is formulated as: 


Where S and t is the dependent variable and the 
matrix of predictor variables respectively, B is the vector 
of regression coefficients, and ¢ is the error term. The 
OLS cost function is augmented with a penalty term 
A|B||_1, where 1 is a hyperparameter that controls the 
strength of the penalty and ||B||_1 is the Ll-norm of the 
coefficient vector. 

Lasso regression has found applications in various 
fields, including finance, genetics, and computer vision. 
Poisson Regression 

To model count data, a type of generalized linear 
model (GLM) known as Poisson regression is often used. 
This approach assumes that the response variable follows 
a Poisson distribution (Joe and Zhu, 2005) with the 
predictor variables affecting the distribution's mean. 

The Poisson regression model can be expressed as: 

log(E(S | T)) = BO + BIT1 + B2T2 +... + BkTk....(8) 


where E(S | T) is the expected value of S given T and 
S is the response variable T is a vector of predictor 
variables, is a vector of coefficients, and The natural 
logarithm (log) function is the link function used in 
Poisson regression, which warranties that the predicted 
values are not negative. The Poisson regression model is 
frequently used to represent count data, such as the 
number of events, occurrences, or observations in a 
particular time or region, in disciplines 
epidemiology, biology, and social sciences. 
Stepwise Regression 

An approach for choosing a selection of predictor 
variables to include in a linear regression model is 


including 


stepwise regression. Depending on the significance of 
each variable, it can be carried out either forwards or 
backwards, adding or eliminating each one one at a time. 
By avoiding overfitting, the objective is to determine the 
most significant predictors. Using a criterion like the F- 
test or AIC, the forward stepwise regression approach 
starts with an empty model and adds variables one at a 
time based on their importance. Starting with a complete 
model, the backward stepwise regression approach 
eliminates variables one at a time according to their 
relevance. 
Mathematically, the forward stepwise regression 
(Chen et al., 2014) method can be expressed as follows: 
1. Start with an empty model: S = 80+ 
2. For each predictor variable Ti, fit the model: S = BO 
+ BiTi+¢€ 
3. Choose the variable Ti that results in the highest F- 
statistic or lowest AIC value 
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4. Add the variable Ti to the model: S = B0 + BiTi + 
BjTj + 

5. Repeat steps 2-4 until no variable can be added to 
the model 

The backward stepwise regression method can be 

expressed as follows: 

1. Begin with a full model: S = BO + B1T1 + B2T2 +... 
+ BkTk + ¢€ 

2. For each predictor variable Ti, fit a model without 
that variable: S = BO + BITI + ... + Bi-1Ti-1 + 
Bit 1Ti+1 + ...+ BkTk + € 

3. Select the variable Ti that yields the lowest F- 
statistic or the highest AIC value 

4. Remove the variable Ti from the model: Y = BO + 
BIT1 +... + Bi-1Ti-1 + Bit+1Tit+1 +...+ BkTk+¢ 

5. Repeat steps 2-4 until no variable can be removed 


from the model. 


Stepwise regression contains constraints and 
underlying assumptions that should be thoroughly 
examined before using it to choose significant predictors. 
Stepwise regression can either be employed in addition to 
or in place of other variable selection techniques like 
regularization or model averaging. 
Multilevel Regression 

Multilevel regression is a statistical method for 
analyzing data that has an ordered or nested structure, 
such as students nested within schools, personnel nested 
within groups or patients nested within hospitals. It is 
also known as hierarchical linear modelling or mixed- 
effects modelling. By modelling the variation at each 
level of the hierarchy and predicting the associations 
between variables at each level, multilevel regression 
takes into consideration the hierarchical structure of the 
data. 

When 


multilevel regression is an effective method that can give 


examining data with nested structures, 


important insights into the correlations between variables 
at different levels of the hierarchy (Bosker and Snijders, 
2012). 
Quantile Regression 

Given the predictor variables, quantile regression 
calculates the conditional quantile function of the 
response variable. It is said that the conditional quantile 
function is: 


Q(s|t) = inf {q: P (s <=q]|t) >=T} ...... (9) 


Where s is the response variable, t is the predictor 
variable (s), t is the quantile of interest (e.g., t=0.5 for the 
median), and Q(s|t) is the value of the response variable 


at the tth quantile given the predictor variables. 
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The quantile regression (Geraci and Bottai, 2007) 
estimator minimizes the following objective function: 


Yi [t—1 (si <= tiB)](p(Si - iB)... eee (10) 


Where (I) is the indicator function, is a vector of 
how the 
required 


regression coefficients, 
residuals weighted. Based on 


and p(u)controls 
the 
characteristics of the estimator, the function p(u) can be 


are 


selected. 
Bayesian Regression 

A Statistical technique for simulating relation-nships 
between factors is called Bayesian regression. Bayesian 
regression offers a means to include prior knowledge or 
beliefs 
conventional regression techniques. 


about the variables in the model, unlike 

Assuming a linear regression model with a regularly 
distributed error structure, response variable y, and 
predictor variable x, we can write: 


s i=B0+B1*t 1+ epsilon i........... ee. (11) 


Where s_i is the observed response for the ith 
observation, t_i is the corresponding predictor value, BO 
and $1 are the intercept and slope coefficients to be 
estimated, and epsilon_i are the error terms assumed to 
be normally distributed with mean 0 and _ variance 
sigma’2. 

In Bayesian regression (Emami et al.,2018) we 
specify prior distributions for the model parameters B0, 
B1, and sigma’2, and update them based on the observed 
data using Bayes' theorem. 

Specifically, we have: 

p(B0, B1, sigma%2 | y, x) = p(y | BO, B1, sigma%2, x) * 
p(BO, B1, sigma’2) ...... eee eee cece eee eeees (12) 


Where, p(y | 60, B1, sigma%2, x) is the likelihood 
function of the data, which specifies the probability of 
observing the data given the model parameters, and p(B0, 
B1, sigma’2) is the prior distribution of the parameters. 
Metrics used in Regression 

Explanation of each metric commonly used to assess 
regression models: 

Mean Squared Error and Root Mean Squared Error 

The MSE is the mean squared error between the 
actual number and the predicted value. A smaller MSE 
(James et al., 2013) indicates a better fit of the model. 

MSE = 1/n* 2 Qin VP isavhriciscncacniananibenny (13) 


where n signifies the numeral of observations, yi 
signifies the expected value for observation i and ¥ 
signifies the average of the actual values. 

RMSE 9 MSE cegcccacecacdntcnuetnctedatunane (14) 
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The same units are used to express the dependent 
variable and the RMSE (Wang and Lu, 2018) which is 
the square root of the MSE. Both metrics penalize large 
errors more heavily than small errors. 

Mean Absolute Error 

The MAE is the average absolute alteration between 
the expected and real values. Like the MSE and RMSE, a 
lower MAE (De Myttenaere et al., 2016) indicates a 
greater fit of the model. Because it does not square the 
errors, the MAE is less susceptible to outliers than the 
MSE and RMSE. 

A Ut a Vl sere ot eneeerdgnwnvraner (15) 


R-squared (R?) and Adjusted R-squared (R?’) 

How well the model accounts for the deviation in the 
dependent variable is determined by its R-squared 
(Gelman et al., 2019) value. Values between 0 and 1 
indicate the goodness of fit, with higher values 
suggesting a better fit. The adjusted R-squared penalizes 
the model for having too many variables and is useful for 
relating models with different numbers of predictors. 

RP = (SSiggf SSH) cosiahiaddinssnweeatonandans (16) 


Where SS;-s represents the summation of squares of 
residuals or the difference between anticipated and actual 
values, and SS, represents the sum of squares of all the 
values (the change between the actual values and the 
mean of the actual values). 

To account for the numeral of predictors in a model, a 
modified version of R-squared known as adjusted R- 
squared is often used: 

Adjusted R? = 1 - [1 - R2)* (n- 1)/(- p- 1)]...7) 


Where p is the numeral of predictors in the model. 
Mean Absolute Percentage Error 

The MAPE is the normal of the total percentage 
differences between the expected and real values 
(Makridakis et al., 2018; De Myttenaere et al., 2016). It is 
expressed as a percentage and is useful for evaluating 
models in which the scale of the variable is important. 
The MAPE is sensitive to small values and can become 
undefined if the actual value is zero. 

MAPE = 100/n * © |(yi - YiV/yil .. 2. (18) 


yi is the expected value for the i observation, and ny is 
the mean of the actual values. 
Coefficient of Determination 

COD, which represents the square of the correlation 


coefficient between the predicted and actual data, is a 
metric of quality of fit. A better fit is indicated by a 
higher COD value, which ranges from 0 to 1. The COD is 
commonly employed in industries like banking and 
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economics even if it is less understandable than R- 
squared (Chicco et al., 2021; Schober et al., 2018). 


COD vrccntekcssaneded snisenenved secs sunsets (19) 
where r is the correlation coefficient. 
Akaike Information Criterion and Bayesian 


Information Criterion 

AIC and BIC are measures that compare the quality of 
a model to that of other models. These metrics consider 
both the model's goodness of fit and its complexity. A 
lower value of AIC (Vrieze, 2012) or BIC (Acquah et al., 
2010) indicates a better fit, with AIC being more severe 
in penalizing overfitting. 
They are calculated as follows: 

AIC = -2 InJJ) + 29 2... eee eee ee eee eee (20) 

BIC = -2 In(JJ) + rIn(m) ...... 2. eee (21) 


In the formula, J represents the likelihood of the data 
given the model, r is the number of parameters in the 
model, and n is the no. of observations. 

Mean Forecast Error 

The average of the discrepancies between the 
predicted and real values is known as the MFE. In 
contrast to the other metrics, a smaller MFE (De 
Myttenaere et al., 2016) is not always superior because it 
ignores the direction of the errors. 


MEE = 1/0 * 5 (yi- i) vececeeeeseeseeeeeeeees (22) 


Methodology 

Here breast cancer datasets have been used for 
research some of which may be more accurate or 
representative of real-world scenarios than others. Here 
are a few examples: 

The SEER Dataset: 

The National Cancer Institute's surveillance, 
epidemiology and end results (SEER) initiative compiles 
information on cancer patients in the country. The SEER 
dataset (Ahmed et al., 2023) contains statistics on patient 
demographics, cancer stage and treatment, as well as 
survival rates for people with breast cancer. 

The TCGA Dataset: 

The full form of TCGA is the cancer genome atlas, it 
is a program that collects genomic data and clinical 
information from multiple cancer types, including breast 
cancer. The TCGA (Dehkharghanian et al., 2023) breast 
cancer dataset includes information on gene expression, 
DNA mutations, and clinical outcomes for breast cancer 
patients. 

The METABRIC Dataset: 

The full form of METABRIC (Chen et al., 2023) is 
the molecular taxonomy of the breast cancer international 
consortium. It is a multi-centre study that collected gene 
information, and survival 


expression data, clinical 
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outcomes for breast cancer patients. The dataset includes 
information on over 2,000 patients with primary breast 


cancer. 
The ICGC Dataset: 
The International Cancer Genome Consortium 


(ICGC) is a collaborative project that aims to collect 
genomic data and clinical information on multiple cancer 
types, including breast cancer. The ICGC (He et al., 
2023) breast cancer dataset includes information on DNA 
mutations, gene expression, and clinical outcomes for 
breast cancer patients. 

In our work, the breast cancer Wisconsin dataset 
(Nemade et al., 2023) is taken, which is one of the most 
commonly used breast cancer datasets. This dataset has 
twelve features and 569 instances. Other versions of this 
dataset have additional attributes or slightly different 
attribute names. Id_number, radius, diagnosis, area, 
texture, perimeter, compactness, concave points, 
smoothness, concavity, symmetry, and fractal dimension 
are the features of this dataset. 

Following are the steps for model building: 
Preprocess the Data: 

Data preprocessing is a vital step in machine learning 
because it can increase the precision and dependability of 
the final model. Here the dataset consists of 569 instances 
and 12 attributes, a detailed explanation of each step in 
preprocessing the Breast Cancer Wisconsin (Diagnostic) 
dataset is given below: 

Importing Dataset: 

Importing the dataset is the first stage. A dataset with 
569 instances and 12 columns is obtained from the UCI 
Machine Learning Repository. 

Splitting the Dataset into Labels and Features: 

A label (output variable) in the dataset shows whether 
the mass was malignant or benign, and features (input 
variables) in the dataset are measurements of various 
characteristics of breast mass samples. Features will be 
separated from the label before applying machine- 
learning algorithms. 

Handling Missing Values: 

It's essential to figure out whether the dataset contains 
any missing values. There are different approaches to 
handling missing values. Here instead of dropping the 
rows with the missing values are imputed with mean, 
median, and mode. 
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Encoding Categorical Data: 

Some features are categorical, such as the diagnosis 
(M or B). These are encoded in numerical values before 
applying the machine-learning algorithm. One popular 
method for encoding categorical data is one-hot 
encoding, which creates a new column for each possible 
value of the categorical variable. 

The general architecture of the preprocessing and 
model building is shown in Figure 1. 

Training and Testing: 

Two sets—the training set and the testing set—are 
produced once the data has been preprocessed. The 
testing set is used to evaluate the machine learning 
model's performance, while the training set is used to 
train the model. Here, 80% of the data are used for 
training and 20% are used for assessment. 

Model Building: 

With different machine learning algorithms like 
decision trees, random forests, support vector machines, 
and neural networks, the Wisconsin dataset is typically 
used for classification tasks. But here regression 
algorithms are used to predict continuous variables 
(radius and area of the breast mass). These continuous 
variables are included as features in the dataset and are 
related to the malignancy of the mass. By Using 
regression algorithms, the radius or area of a _ breast 
mass will be predicted. 

Performance Evaluation: 

Instead of using classification metrics like accuracy, 
precision, recall and F1 score, it is important to assess the 
performance of the regression model using suitable 
metrics like mean squared error, R-squared, mean 
absolute error, coefficient of determination and mean 
forecast error. 

Results & Discussion 

We conducted a regression analysis on the breast 
cancer dataset using the different regression algorithms, 
implemented in Python 3.9.4. The analysis was run on a 
Dell XPS 13 laptop with an Intel Core i7-1165G7 
processor and 16 GB of RAM. perimeter, and 
compactness. In Table 2 the target variable is radius. 
Here gradient boosting regression shows less MSE value 
as shown in Figure 2. 
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Table 2. Comparison between different regressions using radius as a target variable 


Target Variable 


Regression 
Type 


MSE 


R-squared 


MAE 


Radius Linear Regression | 0.08 0.78 0.21 0.77 0.00 
Radius Ridge Regression 0.09 0.77 0.21 0.76 -0.001 
Radius Lasso Regression 0.13 0.66 0.27 0.65 0.00 
Radius Elastic Net 0.11 0.71 0.25 0.70 0.00 
Radius Decision Tree 0.15 0.60 0.30 0.59 0.00 

Regression 
Radius Random Forest 0.07 0.82 0.17 0.82 0.00 

Regression 
Radius Gradient Boosting | 0.05 0.89 0.14 0.88 0.00 

Regression 

Preprocess the data 
. Splitting Handling Encoding 
pleat the Missing Categorical 
=a Dataset Values Data 


Training and 
Testing 


Performance 
Evaluation 


Figure 1. Comparison between different 
regressions using radius as a target variable. 
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Table 3. Comparison between different regressions using perimeter as a target variable. 


Target 


Variable R-squared MAE 

Perimeter Linear Regression 0.24 0.54 0.37 0.53 0.00 

Perimeter Ridge Regression 0.25 0.52 0.38 0.52 0.001 

Perimeter Lasso Regression 0.37 0.29 0.45 0.27 0.00 

Perimeter Elastic Net 0.30 0.42 0.39 0.40 0.00 

Perimeter Decision Tree Regression 0.31 0.40 0.40 0.39 0.00 

Perimeter oe 0.15 0.66 0.27 | 065 | 0.00 
Regression 

Perimeter ne 0.11 0.73 0.23 | 072 | 0.00 
Regression 


Table 4. Comparison between different regressions using compactness as a target variable 


poreet MSE R-squared | MAE 
Variable 


Compactness_ | Linear Regression 0.28 0.46 0.41 0.45 0.00 
Compactness | Ridge Regression 0.28 0.45 0.41 0.44 0.00 
Compactness | Lasso Regression 0.38 0.30 0.46 0.28 0.00 
Compactness Elastic Net 0.32 0.38 0.43 0.37 0.00 
‘sion T 
Compatiness. [en 0.44 0.17 0.50 | 0.15 0.00 
Regression 
F 
Conpactnee.:| | | aap 0.63 0.32 | 0.62 0.00 
Regression 
sane Heme 
Canin | | 688 0.69 0.28 | 0.69 0.00 
Regression 


Comparison of different regression with MSE 
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Regression Types 


Figure 2. Comparison of different regressions with MSE when the target 
variable is a radius. 
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Comparison of different regression with R-Squared values 
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Figure 3. Comparison of different regression with R-squared values 
when the target variable is a radius 
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Figure 4. Comparison of different regression with MSE when the target 
variable is a perimeter. 
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Figure 5. Comparison of different regression with R-Squared values 
when the target variable is a perimeter. 


DOI: https://doi.org/10.52756/jerr.2024.v38.012 


Int. J. Exp. Res. Rev., Vol. 38: 132-146(2024) 


Comparison of different regression with MSE 


MSE 


Regression Types 


Figure 6. Comparison of different regression with MSE when the target 
variable is a compactness. 
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Figure 7. Comparison of different regression with MSE when the target 
variable is a compactness. 


Conclusion 

We have applied six different regression models on 
the breast cancer dataset using various continuous 
variables as the target variable. The Random Forest and 
Gradient Boosting Regression models _ consistently 
outperformed the other models in terms of their mean 
squared error, R-squared, and mean absolute error. 

For example, when using ‘radius’ as the target 
variable, the Random Forest Regression model achieved 
an MSE of 0.07, R-squared of 0.82, and MAE of 0.17, 
while the Gradient Boosting Regression model achieved 
an MSE of 0.05, R-squared of 0.89, and MAE of 0.14. In 
contrast, the other models achieved higher MSE and 


lower R-squared values, indicating that they were not as 
DOI: https://doi.org/10.52756/ijerr.2024.v38.012 


effective at capturing the underlying relationships 
between the predictors and target variables. 

We also calculated the Coefficient of Determination 
(Adj R-squared) to account for the number of predictors 
used in each model. This provided a more accurate 
measure of the model's performance, especially when 
comparing models with different numbers of predictors. 
Gradient Boosting Regression models consistently 
achieved higher Adj R-squared values across multiple 
target variables, indicating that they can better capture the 
variation in the data. 
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